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This paper assesses gender disparities in federal criminal cases. It finds large 
gender gaps favoring women throughout the sentence length distribution (averaging over 
60%), conditional on arrest offense, criminal history, and other pre-charge observables. 
Female arrestees are also significantly likelier to avoid charges and convictions entirely, 
and twice as likely to avoid incarceration if convicted. Prior studies have reported much 
smaller sentence gaps because they have ignored the role of charging, plea-bargaining, 
and sentencing fact-finding in producing sentences. Most studies control for endogenous 
severity measures that result from these earlier discretionary processes and use samples 
that have been winnowed by them. I avoid these problems by using a linked dataset 
tracing cases from arrest through sentencing. Using decomposition methods, I show that 
most sentence disparity arises from decisions at the earlier stages, and use the rich data 
to investigate causal theories for these gender gaps. 
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Estimating Gender Disparities in Federal Criminal Cases 

Introduction 

In the United States, men are fifteen times as likely to be incarcerated as women are. 
But can this gap be explained by differences in criminal behavior or circumstances, or are 
courts or prosecutors treating genuinely equivalent cases differently on the basis of gender? 
The latter would violate the Constitution, undercut the criminal justice system's punishment 
objectives, and contribute to the social consequences of demographically concentrated mass 
incarceration. So the reasons for the gender gap are of considerable legal and policy interest. 
This study explores them using a dataset that traces federal criminal cases from arrest 
through sentencing. I find that gender gaps widen at every stage of the justice process and 
that men and women ultimately receive dramatically different sentences. 

Existing studies of demographic disparities in criminal justice focus on narrow slices 
of the justice process in isolation. Most assess the judge's final sentencing decision, 
controlling for conviction severity or "presumptive sentence" measures that are themselves 
produced by discretionary decisions and negotiations. Ignoring disparities in those earlier 
stages could bias sentencing disparity estimates, both because the key control variable is 
endogenous and because of sample selection fi-om the winnowing of cases at each procedural 
stage. Current sentencing literature typically ignores this "funnel." There is a small literature 
addressing disparities in prosecutorial decisons, but it addresses only certain pieces of the 
process and does not estimate their ultimate sentencing consequences. 

These limitations represent a surprising gulf between the quantitative empirical 
scholarship and the theoretical literature on the criminal justice system, which widely 
recognizes that sentencing is heavily shaped by prosecutors' capacious charging and 
bargaining discretion. This study seeks to close this gap, using a multi-agency linked dataset 
that traces cases from arrest through sentencing. I estimate sentence outcomes conditioned 
on characteristics that are fixed near the beginning of the justice process, rather than near the 
end of it: the arrest offense, criminal history, and other prior characteristics. This approach 
generates a measure of the aggregate gender disparity introduced in the post-arrest justice 
process. I then use sequential decomposition methods to assess how much of this gap 
appears to be explainable by decision-making at each procedural stage. See Altonji, 
Bharadwaj, and Lange 2008; DiNardo, Fortin, and Lemieux 1996. 

In short, I ask: do otherwise-similar men and women who are arrested for the same 
crimes end up with the same punishments, and if not, at what points do their fates diverge? 
Although the arrest offense is not a perfect proxy for underlying criminal conduct, it is a big 
improvement on the highly endogenous controls used in current research. I also use 
estimation strategies — ^reweighting of the mean and the distribution — that offer a useful 
solution to a problem with which sentencing researchers have long struggled: how to treat 
non-prison sentences. The leading approach is a Two-Part Model that separates the 
incarceration decision from the length decision, but that introduces serious sample selection 
concerns if there is disparity in the first stage. The best solution is simply to treat sentencing 
as a single process and estimate disparities in all sentences, including the zeros. Doing so 
with reweighting rather than regression obviates the functional form concerns that underlie 
many researchers' preference for the Two-Part Model. 
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The estimated gender disparities are strikingly large, conditional on observables. 
Most notably, treatment as male is associated with a 63% average increase in sentence 
length, with substantial unexplained gaps throughout the sentence distribution. These gaps 
are much larger than those estimated by previous research. This is because, as the sequential 
decomposition demonstrates, the gender gap in sentences is mostly driven by decisions 
earlier in the justice process — most importantly sentencing fact-finding, a prosecutor-driven 
process that other literature has ignored. 

But why do these disparities exist? Despite the rich set of covariates, unobservable 
gender differences are still possible, so I cannot definitively answer the causal question. 
However, several plausible theories have testable implications, and I take advantage of the 
unusually rich dataset to explore them. I find substantial support for some theories 
(particularly accommodation of childcare responsibilities and perceived role differences in 
group crimes), but that these appear only to partially explain the observed disparities. 

1, Discretion and Gender Disparity in Criminal Justice 

1.1. Sources of Discretion in the Federal CriminalJustice Process 

Just as the states do, the federal justice system gives enormous power to prosecutors. 
The United States in effect has a system of negotiated justice, and prosecutors hold most of 
the chips. They have broad discretion to choose charges from numerous overlapping 
criminal statutes, and then to determine the terms of plea deals. Plea-bargaining does not 
necessarily focus mainly on dropping of charges — indeed, the lead charge was dropped only 
17% of the time in this study's sample. The parties also often negotiate stipulations to key 
"sentencing facts" — for instance, the quantity of drugs trafficked or the defendant's major or 
minor role in a conspiracy. The prosecutor also may make non-binding sentencing 
recommendations or request special leniency to reward cooperators. 

Federal sentencing is guided by two main legal frameworks. First, each criminal 
statute specifies a sentencing range. Most are broad and start at zero (for instance, 0-20 
years), but some specify a "mandatory minimum." Second, since 1987, the statutory 
sentencing constraints have been supplemented by much narrower ranges (for instance, 27 to 
33 months) found in the U.S. Sentencing Guidelines. The Guidelines sought to reduce 
unwarranted disparities in sentencing, including gender disparities (see Breyer 1988), by 
constraining judicial discretion. They were mandatory until 2005, when the Supreme Court's 
decision in United States v. Booker (543 U.S. 220) rendered them advisory. But advisory 
does not mean unimportant — judges are still required to calculate the Guidelines sentence, 
and most sentences are still within the Guidelines range (U.S. Sentencing Commission 2010). 

The Guidelines sentencing ranges are found in the cells of a grid, the two axes of 
which are the "offense level" and the defendant's criminal history. Judges determine the 
offense level based on the crime(s) of conviction and the "sentencing facts." Although 
judges have independent factfinding authority, in practice they usually defer to the plea 
agreement's stipulations (Stith 2008; Schulhofer and Nagel 1997; Powell and Cimino 1995). 
One survey found that 92% of judges said their findings of fact diverge from the plea 
agreement either "infrequently" or "never" (Gilbert and Johnson 1996). 

Legal scholars widely agree that the Guidelines greatly empowered prosecutors 
because the sentence was now far more consfrained by the charges of conviction and 
especially by the negotiated "sentencing facts" (Stith 2008; Bibas 2009). Prosecutors thus 
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could both threaten long sentences and virtually promise much lower ones in exchange for 
guilty pleas, and plea rates rose from 87% to 97%, where they remain today (Alschuler 2005; 
Miller 2004). Although Booker expanded judicial discretion, the continued high rate of 
Guidelines compliance means these sources of prosecutorial influence have not disappeared. 
In addition, prosecutors can still firmly bind judges using mandatory minimums. 

Prosecutors have a variety of incentives to balance, including career incentives that 
push toward maximizing sentences and resource constraints that discourage going to trial 
(see, for example. Baker and Mezzetti 2001; Easterbrook 1983). In addition, prosecutors 
may be affected by sympathy or a sense of fairness. SchuUiofer and Nagel (1997) review 
federal prosecutors' case files and find evidence of deliberate charge manipulation to avoid 
excessive sentences. Prosecutorial discretion is often described as the power not to seek to 
maximize punishment — to be selectively lenient (see Stith 2008). Although there may be 
good policy reasons for allowing such discretion, it is a potential source of unwarranted 
disparity if it is influenced by legally irrelevant factors such as gender. 

1.2. Existing Empirical Research 

Existing studies of demographic disparities in criminal justice have typically focused 
on single stages of the criminal process in isolation — ^usually, the judge's final sentencing 
decision. In the federal-court literature, the usual approach is to estimate gaps in sentence 
outcomes when controlling for the Guidelines offense level and the defendant's criminal 
history. These two key controls are often combined into a "presumptive sentence," usually 
the lower end of the Guidelines range (U.S. Sentencing Commission 2010), or into dummies 
for the Guidelines grid cell (see, for example. Mustard 2001). Similarly, state-level studies 
generally control for some measure of conviction severity as well as criminal history (see, for 
example, Steffensmeier, Kramer, and Streifel 1993). 

Studies of gender disparity that take this approach have usually found that women 
receive shorter sentences, conditional on observables. The size of this effect has varied 
considerably, even among studies that use federal data. Samikar et al. (2007) find about a 
30%) unexplained gender gap in sentence length, as did a prominent recent U.S. Sentencing 
Commission (2010) study. Many studies, however, have estimated considerably smaller 
disparities — for instance, Stacey and Spohn (2006), Schanzenbach (2005), and Mustard 
(2001) all find average gender gaps in sentence length of around 10%. 

The problem with the dominant approach is that the key control variable is itself the 
result of a host of discretionary decisions made earlier in the justice process, which these 
studies ignore. The resulting sentencing disparity estimates are potentially biased by the 
endogeneity of the key control variable as well as sample selection introduced by the 
dismissal of cases prior to sentencing. Although there have been occasional studies of plea- 
bargaining disparities (see, for example, Spohn and Spears 1997; Shermer and Johnson 
2010), they concern only certain bargaining outcomes, such binary measures of whether any 
charges were dropped, and ignore negotiation over sentencing facts, which is the key aspect 
of bargaining in the modern federal system. Moreover, without assessing disparities in 
prosecutor's initial choice of charges, the charge-bargaining results are not very meaningful.^ 



' Spohn, Gruhl, and Welch (1987) found gender disparities favoring women in the rate of filing felony charges 
in Los Angeles County, but did not analyze charge severity as an outcome. 
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Further, the plea-bargaining studies tend to assess that stage in isolation too, rather than 
assessing its ultimate sentencing-disparity consequences. 

1.3. TheDataset 

This study uses data from four different federal sources: the U.S. Marshals' Service 
(USMS), the Executive Office of U.S. Attorneys (EOUSA), the Administrative Office of the 
U.S. Courts (AOUSC), and the U.S. Sentencing Commission (USSC); the Bureau of Justice 
Statistics provided inter-agency linking files that allow cases to be traced from arrest through 
sentencing. The main sample consists of federal property and fraud crimes, drug crimes, 
regulatory offenses, and violent crimes sentenced between FY 2001 and FY 2009.^ 
hnmigration cases, which have different stakes centering on deportation, were excluded. To 
reduce common support concerns, offense categories that were over 95% male were dropped: 
weapons, sex and pornography, conservation, and family offenses. 

The data include rich offense and offender information, including arrest offense 
(which USMS identifies with 430 codes),^ gender, race, age, marital status, district, 
citizenship, a string field describing the offense, criminal history, number of dependents, 
education, Hispanic ethnicity, counsel type, co-defendant information, and county. AOUSC 
also lists the initial and final charges ; these statutory sections then had to be coded on a 
numeric charge severity scale. I constructed three such scales based on combined severity of 
all charges: the statutory maximum, the statutory minimum, and a Guidelines-based measure. 
If the statute prescribed varying sentences depending on case facts, I used default 
assumptions grounded in legal research. For further details, see the Data Appendix. 

2. Analysis and Results 

2.1. Filing and Conviction-Stage Disparity 

This study principally focuses on whether male and female arrestees ultimately 
receive the same sentences, but a threshold question is whether they are equally likely to be 
sentenced at all. Disparities in charging and conviction rates are important outcomes in their 
own right, and also are potential sources of sample selection bias in the sentencing analysis. 
To be included in the sentencing data, defendants must first face charges before a district 
court judge — a close proxy for felony charges because misdemeanors are usually handled by 
magistrates. Second, defendants must be convicted of a non-petty offense: a felony or a Class 
A misdemeanor. Accordingly, I begin by estimating the probability of these events. 

Columns 1 and 2 of Table 2 report the "male" odds ratios from logistic regressions."* 
Conditional on arrest offense, district, race, citizenship, and age (the variables observed for 
all arrested defendants), male arrestees face a modestly but significantly higher probability of 
a charge before a district judge: 92.2% for the average male and 90.7% for the equivalent 



^ For the filing and conviction analyses, the sample consists of cases charged or disposed of during that period. 
3 1 grouped certain closely related codes and subdivided certain drug codes based on a separate drug-type field. 
There were 123 arrest offenses after this receding, and the results are robust to use of the original codes. 
^ Except where other clustering is noted, all standard errors are clustered on arrest offense and district 
(combined), due to concern that local crime patterns or the U.S. Attorney's Office's priorities might introduce 
correlations. Results are robust to clustering on arrest offense or district alone. 
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female.^ Conditional on the same variables plus multi-defendant case structure, male district 
court defendants are also significantly more likely to be convicted of a non-petty offense 
(93.2% versus 91.4%; Table 2, Col. 2).*' Sample selection bias from filing and conviction are 
likely to downward-bias the sentencing disparity estimates reported below, but fairly slightly, 
because these initial disparities affect relatively few cases. I therefore do not correct the 
sentencing-stage estimates below for sample selection at these threshold stages. 

2.2. The "Two-Part Model" of Incarceration Probability and Sentence Length 

When estimating sentencing disparity, a threshold question is how to treat non-prison 
sentences such as probation or tines (18% of the sample). This question has been hotly 
debated in sentencing research. The leading practice is to break sentencing into two decision 
processes, each estimated parametrically: whether to order incarceration and, if so, for how 
long (see, for example. Berk 1983). The theory is that non-prison sentences are have no 
obvious "prison equivalent," and moreover, some covariates might be more influential in the 
incarceration decision than the length decision or vice versa. A practical advantage is that 
constraining the length sample to positive-length cases allows log transformation without 
having to assign some arbitrary small value to the zeros. ^ This is ideal because sentencing 
law is structured so that inputs to sentencing will generally have multiplicative effects — each 
Guidelines grid cell is a multiplier of the ones adjacent to it. 

Although I prefer a different approach (discussed below), for comparability to the 
current hterature, I begin with estimates for this "Two-Part Model" (TPM). Table 2, Column 
3 shows the results of a logistic regression of an incarceration indicator on gender, arrest 
offense, criminal history, district, race, age, education level, U.S. citizenship, and the multi- 
defendant case flag. The average male in the sample faces an 86% probability of 
incarceration; comparable females are nearly twice as likely to avoid incarceration (74%). 
Conditional on incarceration, men receive sentences that are approximately 34% longer. 

The complication is that the gender disparity in the incarceration decision almost 

Q 

surely means that the length estimates are downward biased by sample selection. 
Criminologists have often responded to this problem with Heckman-style corrections (see 
Heckman et al. 1988; see Ulmer and Bradley [2006] for sentencing examples), but this 
approach is not ideal because there is no plausible exclusion restriction.^ hi addition, the 
approach assumes that the estimand is the average treatment effect (the "ATE") on the 
underlying population. In this context, that is a strange object: the gender disparity in prison 
sentence length that would be observed in a hypothetical world in which all defendants had 
to go to prison. This thought exercise is of improbable interest to pohcymakers. 

^ This sample consisted of arrestees facing some charge. Cases that were entirely declined were dropped 
because they often represent unknown outcomes (transfers to other authorities or districts). When declinations 
citing a favorable reason (such as lack of evidence) are included as zeros, the gender disparity stays significant. 
^ Petty offense convictions and jury acquittals are rare, so this disparity is driven by dismissals by prosecutors. 
' The resulting estimates would be extremely sensitive to the choice of small value. Note that there are also a 
very small number of life sentences, which I code as 540 months based on hfe expectancy data. 
* The direction of bias is clear because of the incarceration decision and the prison length decision are both 
driven by observable and unobservable factors affecting case severity. If selection-on-observables holds in the 
full sample, it almost surely will not hold in the sample of nonzero prison cases, because the incarceration 
regression indicates that conditional on the observed covariates, men are more likely to be incarcerated — that is, 
it takes less severe unobservables to push a given male case into the incarceration sample. 
' As Bushway, Johnson, and Slocum (2007) point out, the sentencing literature tends to ignore this problem. 
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If one is to follow the Two-Part Model at all, it is better instead to ask: If we went 
from treating everyone like women to treating everyone like men, 

(1) what percentage of non-prison sentences would be replaced with prison, and 

(2) among cases that already would have received prison sentences, how would the 
average length of those sentences change? 

More formally, the quantities of interest are: 

(1) E(P^|X)-E(P^|X) 

(2) E(Y'^|X, P^=I) - E{Y^\X, P^=l) 

where P indicates a prison sentence, Y is prison sentence length, M and F denote the male 
and female treatment conditions, and X is the covariate distribution for the population 
noted/*^ Object (2), in my view, is of more policy interest than the full-population ATE, 
requiring no speculation about a world in which probation and fines were not possible. 

With the estimand framed this way, the selection bias problem is not that the 
estimation sample contains too few females, but that it contains "extra" males who would not 
have been incarcerated if they were female. If it were possible to identify who those extra 
males were, OLS regression in a sample excluding them would be an unbiased estimator of 
object (2). Unfortunately, while the number of extra males can be readily estimated based on 
the incarceration logit,^^ they cannot be identified; P^ is unobserved for males (see Lee 
[2009], who discusses an analogous problem). In Table 3, 1 apply varying assumptions as to 
which males were marginal to produce different trimmed-sample estimates. 

Table 3, Column 1 replicates the "male" coefficient on log prison sentence length 
from the full-sample OLS regression. Because sample selection bias is almost surely 
downward, this should be freated as a lower bound on the true sentence length disparity 
within the pool of cases that would have been subject to incarceration regardless of gender. 
Column 2 provides something roughly approximating an upper bound, based on a near- 
worst-case assumption about selection bias. The Column 2 sample has trimmed the males 

12 

with the lowest (most negative) individual influence on the "male" coefficient. In this case, 
the Column 2 length-disparity estimate is about 67% — approximately double the estimate for 
the unfrimmed sample. Columns 3 and 4 of Table 3 show results for samples frimmed based 



This notation assumes monotonicity, such that P^=l whenever P''=l. 
^' This assumes gender monotonically affects incarceration probability, a reasonable assumption: being male 
greatly increased that probability in every one out of dozens of analyzed subsamples. 

Lee (2009) proposes a similar trinmiing method for estimating bounds on the effect of a randomly assigned 
treatment when treatment monotonically affects attrition. In that case worst-case bounds can be more readily 
estimated; the trim that will raise the treatment effect estimate by the most is just the lower tail of the treated 
outcome distribution (see Lee 2009). The trim I conduct in Table 3, Colunm 2 is based on the same intuition. 
But rather than assuming random treatment, I assume selection on observables within the full sentenced sample, 
and use regression to estimate the number of "extra males" and to model the outcome. This assumption could 
certainly be challenged, as I discuss below, but it already underhes both parts of the TPM; my method simply 
gives a near-worst-case adjustment for the second-part estimate assuming that the first part is correct. 

When there are covariates, one cannot just trim the lower tail; rather, the trim is based on the 
observations' influence on the partial effect of being male. Estimating a true upper bound would require 
trimming the group with the most negative joint influence on the "male" coefficient. Identifying that group is 
computationally impossible. But ranking observations by individual influence is easy and is, in practice, 
probably a "bad enough" assumption about sample selection to provide useful guidance as to its possible scope. 
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on a plausibly realistic (rather than worst-case) assumption about who the marginal males 
are. The assumption is simply that they are those with short sentences — that is, that gender is 
likelier to be the deciding factor in closer cases. The Column 3 sample trims the males with 
the very shortest nonzero sentences (one year or less), while the Column 4 sample picks them 
randomly from the bottom quarter of the distribution (two years or less). The estimates for 
these two frimmed samples are 63% and 47%, respectively. 

This trimming exercise is not meant to "correct" sample selection bias, but rather to 
provide a general sense of its possible magnitude. Unfortunately, the potential bias here is 
large, rendering the TPM not ideally informative. The TPM remains appealing when the 
disparity in incarceration probability is small, such that selection bias is likely minor; for this 
reason, Rehavi and Starr (2012a) used it to assess racial disparity. In the gender context, 
however, more useftil guidance can be found using other methods. 

2.3. Inverse Propensity-Score Weighting Estimates of Gender Disparities 

The sample selection problem described above would not exist but for the choice to 
model the determination of sentences as two distinct decision processes, a choice that is not 
compelled by theory.'^ I propose a simpler approach: keeping non-prison sentences in the 
sample for the length-disparity estimates, and treating them as zeros. 

While the Two-Part Model dominates the sentencing literature, a substantial minority 
of the literature rejects it. Researchers following the minority approach typically instead treat 
sentencing as a single process in which the non-prison cases are censored, applying a Tobit 
model that estimates average disparity in an underlying latent variable (see Tobin 1958; see 
Samikar et al. [2007]; Bushway and Piehl [2001]; Kurlychek and Johnson [2004]; and 
Albonetti [1997] for sentencing examples). This approach avoids the sample selection 
concern, but raises other practical problems. The Tobit is not robust to violations of its 
assumptions of normality and homoskedasticity (see, for example, Arabmazar and Schmidt 
1982; Cameron and Trivedi 2010) — and in this sample, specification tests for the Tobit are 
decisively failed. Moreover, while the Tobit allows researchers to avoid assigning a specific 
value to the non-prison sentences, they still must choose a censoring point below which their 
value is assumed to fall. This choice is arguably equally arbitrary, and if the length variable 
is log-transformed, it will have a big effect on the Tobit estimates.^"* 

The approach I propose is conceptually simpler than either the Tobit approach or the 
Two-Part Model, and avoids the practical weaknesses of both. If incarceration disparities are 
the outcome of policy interest, then there is nothing unknown about the value of non-prison 
sentences: they are correctly valued at zero. The main practical drawback of including them 
is that it precludes log transformation, but this functional form concern is only a problem for 
parametric estimation. I instead estimate the average length disparity in months by inverse 
propensity score weighting ("IPW"), without specifying any ftmctional relationship between 



'^Bushway and Piehl (2001) provide strong reasons that a single-decision model (in particular, the Tobit) is a 
better fit to the Guidelines process, in which zeros are just values in the lowev end of the sentencing grid. 

For instance, using a lower limit of half a day in the the Tobit log prison model (and the same covariates as in 
the TPM above) produces a gender disparity estimate of 128%, while a limit of one month produces an estimate 
of 72%. Either limit is theoretically defensible, as are many others. While the very lowest observed nonzero 
sentence is one day, only 0.3% are below one month. One might reasonably set the limit to censor these cases, 
to avoid giving excessive weight to large multiplicative differences between trivially short sentences. 
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the covariates and the outcome variable. I then extend this method to the distribution, 
allowing assessment of disparities in incarceration probability as well as other possible 
heterogeneity in gender effects on sentences of different lengths. 

The IPW estimates of average gender disparities in sentence length are given in Table 
4. The probability of being male {E(M\Xi) for each observation (the "propensity score") is 
first estimated by a logistic regression of "male" on the covariates X: gender, arrest offense, 
criminal history, race, age, education level, U.S. citizenship, and the multi-defendant case 
indicator.'^ Estimates of average gender disparities are then produced via weighted 
regression where the weights are inverse functions of the propensity score. To refer to the 
estimands, I use the common language of "treatment effects," where "treatment" refers to 
being male. But note that for these "effects" to be given a causal interpretation, one must 
assume there are no confounding variables; I return to this point below. 

In Column 1 of Table 4, 1 estimate the overall average gender disparity in sentence 
length conditional on the pre-charge covariates. This "average treatment effect" (ATE) 
represents the difference between two counterfactuals: the mean sentence if everybody were 
treated like males and the mean sentence if everybody were treated like females (see 
DiNardo 2002).'^ Table 4, Columns 4 and 7 reflect separate estimates of the average effects 
of gender disparity on male and female sentences. The "average treatment effect on the 
treated" (TOT) reflects the estimated effect of being male on male sentences, and is 
estimated by comparing the observed male average to a reweighted female average (Col. 
4). After this reweighting, the female endowments of covariates are similar to those of the 
males, so the reweighted female mean can be interpreted as a counterfactual mean if males 
were treated like females. The "average treatment effect on the untreated" (TUT) is 
conversely estimated by reweighting the males, and represents the counterfactual increase in 
sentence if females were treated like males (Col. 7). 

As Table 4 shows, even after reweighting, the average gender gaps in sentence length 
are strikingly large. The overall average disparity (the ATE) in Column 1 is 23 months, 
which translates into a 63% increase in sentence length. When measured in months, gender 
appears to have a bigger effect on males than females (compare Columns 4 and 7): being 
male increases male sentences by 25 months, and would increase female sentences by 15 
months. But this difference is mostly because of a higher baseline average: in percentage 
terms, the TOT and TUT are not very different (64% versus 61%). 

A drawback of propensity score reweighting is its vulnerability to the problem of 
limited overlap between the male and female samples (see Busso, DiNardo, and McCrary 
2008). Although the large sample size reduces this concern, women are only 19%) of the 
sample and are thinly represented in certain offenses and high criminal history categories. 
The reweighting of the female distribution risks giving unduly high weight to women with 
unusual covariate values. In Table 4, Columns 2 and 5, 1 report the ATE and TOT for a 



District fixed effects, which were included in the Two-Part Model, are not included in the weights. When 
reweighting, parsimony makes it easier to balance the most important variables, and gender composition does 
not vary much by district in any event. The results are robust to including the districts. 

The weights are given by l/(l-E(M\Xi)) for female observations and 1/ E(M\Xi) for males, before rescaling to 
average 1 (see Busso, DiNardo, and McCrary 2008). 

The weights are E(M\Xi)l(l-E(M\Xj) for female observations, before rescaling to average 1. 

See Figure la for the propensity score distribution. 



8 



Starr — Estimating Gender Disparities in Federal Criminal Cases 



sample that eliminates those problematic covariate combinations by trimming extreme 
propensity score values (see, for example, Heckman et al. 1998).'^ The drawback with this 
method is that the sample to which the estimates apply is not very intuitively or transparently 
defined. In Columns 3 and 6, 1 report the ATE and TOT for an alternate sample that 
excludes the highest three criminal history categories. Both trimming strategies produce 
gender disparity estimates that are fairly similar in percentage terms to the full-sample 
estimates (compare Columns 1 through 3 and Columns 4 through 6). 

I report only the full-sample results for the TUT (the effect of gender on women), 
because estimating it depends on reweighting only the males, and no males have propensity 
scores anywhere near zero. For this reason, as I proceed below to analyze the gender 
disparity in more detail, I focus on the counterfactual effects if women were treated like men. 
The effects of gender on men and women are of equal policy interest, but analyzing the TUT 
is simpler because the full sample can be used without limited-overlap concerns. 

Table 5 accordingly shows TUT estimates for subsamples and alternate 
specifications. Column 1 replicates the main estimate Irom Table 4 for comparison purposes. 
Columns 2 and 3 show estimates for two large offense-tj^e categories: drug offenses 
(Column 2) and property, fraud, and regulatory offenses (Column 3). In percentage terms the 
effects are similar. The disparity is likewise almost identical in percentage terms before and 
after the watershed Booker decision (Columns 4 and 5).^' It is smallest for non-parents and 
largest for single parents (51.6% versus 67.3%; compare Columns 6-8). It is larger for 
defendants in multi-defendant cases than for sole defendants (66% vs. 51.2%, Columns 9- 
10), much larger among blacks than non-blacks (74% vs. 51.1%, Columns 1 1-12), and 
slightly larger in states without federal women's prisons (Columns 13-14). Many of these 
subsample comparisons are useful in assessing possible causal theories for the unexplained 
gender gap, and they will be further addressed in the Discussion. 

The remainder of Table 5 shows the robustness of the TUT estimates to alternate 
specifications of the gender-propensity model. Columns 15 and 16 show that the TUT is 
unchanged by the addition of a set of flags for case characteristics mentioned in a text field 
based on the arresting officers' notes (in 2001-2007, the years the field is available). The 
flags are for mentions of guns, other weapons, drug seizures, official victims, minor victims, 
conspiracy and racketeering. Columns 17 through 20 show that the estimates are robust to 
adding controls for marital and parental status and defense counsel type. Disparities decline 
slightly when controUing for pleas and time elapsed before conviction (Col. 21). The gender 
disparities in drug cases decline slightly when drug quantity seized at arrest, as recorded in 
the EOUSA investigation files, is added to the controls. This check could only be performed 
for arrests before 2004 because of data limitations (compare Columns 22 and 23).^^ 



The propensity-score cutoff (approximately 0.93) is optimized to minimize variance (see Crump et al. 2009). 
The trim drops about 4% of women and 21% of men from the sample. 

The main sample already excludes the most male-dominated crime categories. Adding the criminal history 
constraint does not entirely eliminate the limited overlap problem, but mitigates it considerably (see Figure lb). 
^' This does not preclude the possibility that Booker changed disparities; this analysis does not seek to 
disentangle Booker's causal effects from longer-term trends. 

Results are also robust to the use of the original imgrouped arrest codes; the addition of district controls, 
Hispanic ethnicity, and county-level controls for poverty rate, unemployment, per capita income, and crime 
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Finally, a comparison of Colunm 1 and Column 24 of Table 5 illustrates the 
importance of the choice to condition on arrest offense rather than on the end result of 
sentencing fact-finding. The Column 24 reweighting substitutes the final Guidelines offense 
level instead of the arrest offense, and the estimated disparity is reduced by 63%. This 
comparison suggests that by conditioning on an endogenous variable and ignoring gender 
disparities introduced earlier in the justice process, the current literature may have 
substantially understated the size of the gender gap. 

In Figure 2, 1 extend the reweighting method to estimate the effect of gender on the 
distribution of sentences for females following the method proposed by DiNardo, Fortin, and 
Lemieux (1996). The white and black bars reflect the observed distribution of sentence 
lengths for male and female defendants, respectively; non-prison sentences have their own 
bin and need not be assigned a numeric value. The checkered bars represent the 
counterfactual distribution if females were treated like males. Comparison of the checkered 
to the black bars shows large unexplained gaps throughout the distribution. The unexplained 
gap in the share sentenced to non-prison sentences (about 1 1 percentage points) is similar to 
the regression estimate in Table 2. The gap is not confined to the low end — the whole 
reweighted male distribution is shifted to the right relative to the female distribution. 

2.4. Decomposing the Gender Gaps 

The estimates presented above represent the aggregate disparities introduced 
throughout the post-arrest justice process, raising the further question oiwhen in the justice 
process those disparities emerge. Table 6 shows a sequential decomposition of the observed 
average gender disparity into components explainable by pre-charge covariates and by each 
subsequent stage of the process: charging, charge-bargaining, sentencing fact-finding, and 
sentencing. The method is a sequence of inverse-propensity score reweightings, in which 
new variables are added to the propensity score estimation at each step (see, for example, 
Altonji, Bharadwaj, and Lange 2008; DiNardo, Fortin, and Lemieux 1996). 

In this part of the analysis, data limitations require separate assessment of drug and 
non-drug cases. For non-drug crimes, the initial and final charges were coded with the 
statutory minimum, maximum, and Guidelines measures described above. But in drug cases, 
the AOUSC charge data are too ambiguous to permit that coding; the same statutory 
subsections encompass a vast array of drug types, quantities, and sentences. The only usable 
measure of statutory severity available for drug cases is the mandatory minimum for the 
crime of conviction, which the Sentencing Commission records. Thus, in drug cases I cannot 
disentangle the effects of initial charging and subsequent charge-bargaining. The mandatory 
minimum variable represents the combined effect of those stages. 

The non-drug decomposition is shown in Panel A of Table 6. Column 1 shows the 
raw observed gender gap to be decomposed. In Column 2, the men have been weighted 
based on pre-charge covariates. Columns 3, 4 and 5 sequentially add the initial charge 
severity measures, the conviction measures, and the final offense level (the product of 
sentencing fact-finding). The drug decomposition (Panel B) has one stage fewer: the 
conviction mandatory minimum substitutes for the separate charging and conviction 
variables. The explanatory value attributed to each stage is the change in the unexplained 



rate; and various exclusions from the sample: cases in which the indictment was issued before the arrest, cases 
from the South, and arrests by each of the two enforcement agencies (the FBI and the DBA). 
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gender gap when one adds that stage's measures. What remains after the final reweighting is 
attributed to the sentencing decision. In the last two lines of each panel, I express each 
component as percentages of the raw observed gender gap and of the gender gap that was 
unexplained by the pre-charge covariates. That is, the last line decomposes the gender 
disparity that appears to be introduced during the criminal justice process. 

This method of decomposition is path-dependent: explanatory value is preferentially 
attributed to the covariates that are added first. Path-dependence is often a drawback to 
sequential decomposition, because in many contexts, when multiple correlated covariates 
together explain a certain portion of an outcome gap, there is no theoretical reason to 
"blame" one over the others (see Fortin, Lemieux, and Firpo 201 1; DiNardo, Fortin, and 
Lemieux 1996). But here path-dependence is desirable, because the justice process is itself 
path-dependent: earlier decisions constrain later ones.^^ The decomposition tracks the 
divergence of men's and women's fates as the process advances, so it would not make sense 
to attribute to a later stage a disparity that already existed. When there is a natural ordering 
like this, sequential decomposition is appropriate (see Altonji, Bharadwaj, and Lange 2008). 

The decompositions show that significant new disparity favoring women is 
introduced at every stage of the justice process, but sentencing fact-finding is especially 
crucial. In non-drug cases, an eight-month gender gap remained unexplained after 
reweighting by arrest offense and the other pre-charge covariates — this is the gap attributed 
to the justice process as a whole. Initial charging and charge-bargaining contribute about 9% 
and 4% of the gap, respectively; Guidelines fact-finding explains 60%, leaving 27% for the 
final sentencing stage to explain. In drug cases, the mandatory minimum can explain one 
third of the 23-month gender gap attributed to the justice process. Guidelines fact-finding 
can explain 29.5%, leaving 37% attributed to the final sentencing decision. 

In Figures 3a through 3d, I show a similar sequential decomposition of the sentencing 
distributions (see DiNardo, Fortin, and Lemieux 1996). Figure 3a shows the distribution of 
non-drug sentences observed for males and females and, between them, the distributions 
produced by the same series of reweightings described above. Each step in the sequence 
makes the male distribution look somewhat more like the female. Figure 3b presents these 
results in a way that (while it does not show the underlying distributions) allows the 
procedural sources of the gaps in the distribution to be more readily discerned. The full 
height of each bar represents the gap in the cumulative distribution at the denoted sentence 
threshold after reweighting by the pre-charge covariates — that is, the gap in the probability of 
getting a sentence exceeding the threshold. The patterned sections decompose these gaps 
into charging, charge-bargaining, fact-finding, and sentencing components. Figures 3c and 
3d repeat these exercises for drug cases. The decompositions again show the central role of 
sentencing fact-finding, especially in explaining gaps higher in the length distribution. 
Judges' final sentencing decisions appear to be more important in explaining disparities at 
the lower end, particularly in the incarceration decision (Figs. 3b, 3d). 

Because fact-finding and Guidelines departures are both stages in which men's and 
women's outcomes appear to diverge substantially, it is worth inquiring whether any 
particular findings of fact and departures appear to be key factors. Table 7 shows the 



For instance, the initial charges define the range of possible outcomes to charge-bargaining; charges are 
almost never added (and in most cases are not dropped). 
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explanatory value attributed to each of several findings and departures when they are added 
to the mean decompositions from Table 6. These variables were not added sequentially with 
one another because there is no natural ordering among them; each was added independently. 
If they are correlated, the sum of the shares reported likely overstates their collective 
importance.^"* Each share is thus best interpreted as the maximum the variable can explain. 

The factors listed in Table 7 were assessed because they are factors that one might 
expect to vary by gender. Their relevance to possible causal theories for gender disparity are 
addressed in the Discussion below. Other than the factors analyzed here, sentencing fact- 
finding involves a vast array of context-specific inquiries. Likewise, other stated reasons for 
departures vary widely, and are often vague, such as "the interests of justice." 

3. Discussion 

The unexplained gender disparities identified above are large — much larger than 
those estimated via the prevailing method of conditioning on presumptive sentence. The key 
interpretive question is why these gaps exist — and, in particular, whether unobserved 
differences between men and women might justify them. One cannot instrument for inborn 
traits or manipulate them, so estimation of demographic disparities always risks omitted 
variables bias, and one must be cautious about inferring gender discrimination. Still, some 
often-advanced causal theories have testable implications. In this Part, I consider the leading 
theories suggested in the literature and in my informal conversations with criminal lawyers. 

3.1. Unobserved differences in offense severity. 

One obvious question is whether the crimes differ in ways not captured by the arrest 
offense codes. The arrest offense is not a perfect proxy for underlying criminal conduct, and 
if it overstates the severity of female conduct relative to that of men, that might explain some 
of the observed disparity. In particular, one might wonder whether the disparities introduced 
at sentencing fact-finding merely represent the process's proper accounting for nuance 
differences in facts within offense categories, which is, after all, fact-finding's purpose. 

Unobserved differences naturally cannot be ruled out, but there are good reasons to 
doubt that they explain much of the observed disparity. First, the observable covariates are 
detailed, capturing considerable nuance. They include not just the 430 arrest codes and the 
multi-defendant flag (a proxy for group criminality, an important severity criterion), but also 
additional flags based on the written offense description (see Table 4, Rows 15-16). Second, 
the disparities are similar across all case types (and across arresting agencies), suggesting it is 
not a matter of a few crimes being "worse" when men commit them. Such differences would 
have to be prevalent across a variety of crimes and agencies to explain the result. 

Third, there is some reason to believe unobserved divergences between the arrest 
offense and actual criminal conduct may bias disparity estimates downward. If police tend to 
treat men more harshly, one might expect them to record arrest offenses that overstate men's 
culpability relative to women's. The empirical evidence on gender and policing is limited. 
Traffic stop studies reach divergent conclusions about whether there is bias against men 
(compare Rowe 2009 with Persico and Todd 2006), but at least do not suggest bias against 
women. A study covering a wider range of crimes (Stolzenberg and D'Alessio (2004)) found 



This is almost surely the case with the fact-finding results in drug cases, where the shares reported in Table 7 
add up to slightly more than the total months of disparity attributed to fact-finding in Table 6. 
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that other factors equal, reported crimes with female offenders are substantially less likely to 
lead to arrests, results that they interpret to show police leniency toward women. 

Nonetheless, there are some easily imaginable differences between male and female 
cases that might not be observed. For instance, men might well commit violent crimes with 
greater force, a difference not fully captured by the arrest code (beyond the labeling of some 
assaults as "aggravated"). There are fewer obvious potential differences in property, 
regulatory, or drug offenses, but perhaps women might commit smaller-scale offenses. Scale 
is captured to some degree by the arrest offense codes (for instance, pickpocketing versus 
vehicle theft), but not entirely — for instance, wire fraud could be in any amount. Findings of 
fact on loss value appear capable of explaining up to 20% of the otherwise-unexplained gap 
in non-drug crimes (Table 7). Unfortunately, there is no way to tell how much of that fact- 
finding difference reflects true underlying differences in the facts. 

With respect to drug quantity, the data are more informative. Drug quantity and type 
determine eligibility for mandatory minimums, which explain 29.5% of the post-arrest 
gender gap in drug cases (Table 6); related Guidelines adjustments can explain a further 3% 
(Table 7). ^ For arrests before FY 2004, the drug quantity and type seized at arrest is 
recorded in the EOUSA investigation file. Within that pool, there are substantial gender 
disparities in the drug quantity found at the sentencing stage, even after controlling for drug 
quantity at arrest and the other standard covariates. The estimated gender gap in sentences 
in pre-2004 drug cases is only slightly reduced by adding arrest-stage drug quantity controls 
to the reweighting (Table 5, Cols. 22-23). These findings suggest that quantity findings at 
sentencing diverge fi-om the underlying facts in ways that differ by gender. 

Another key factor affecting drug sentencing is the "safety valve" loophole built into 
the drug mandatory minimum statutes and the related Guidelines safety valve. The safety 
valves can explain up to 9% of the sentence gap in drug cases, and one might wonder 
whether this reflects "real" case differences. Eligibility for the safety valve is defined by 
statute, and cases can be coded as seemingly eligible or not based on the case's observed 
characteristics: criminal history, certain offense features, lack of aggravating role, and lack of 
obstruction. Conditional on apparent eligibility, women are significantly more likely to get 
safety-valve reductions. This is only suggestive evidence of disparate treatment, however, 
because the observables do not perfectly track the eligibility requirements. 

3.2. The "girlfriend theory. " 

In group offenses, another factor affecting culpability is relative role. Women might 
be viewed as minor players — ^perhaps mere accessories of their male romantic partners. 
Prosecutors and judges may consider such women less dangerous, less morally culpable, or 
useful sources of testimony. While leniency may be appropriate in such cases (see Raeder 



Drug quantity findings drive both the application of mandatory minimums and the more nuanced gradations 
under the Guidehnes. The 3% figure in Table 7 reflects only the latter component: the additional gender 
disparity explained by quantity findings after mandatory minimums had already been accoimted for. 

The key source of discretion in safety valve application is the prosecutor's choice w^hether to characterize the 
defendant as having been fully truthful in describing the crime (see 18 U.S.C. 3553(e)). Beyond the absence of 
obstruction and the presence of acceptance-of-responsibility reductions, discussed above, the data do not 
provide a way to assess whether the defendant was in fact truthful. 
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[2006]), some lawyers I spoke to suggested that such perceptions are not always justified by 
the facts; in cases involving couples, it may just be assumed that the female is the "follower." 

The data provide no way to test whether role perceptions are well founded, but they 
do suggest that they can partially explain the gender gap. Other than its implications for 
cooperation departures, the "girlfriend theory" has two testable implications: first, the gender 
gap should be larger in multi-defendant cases, and second, part of it should be attributable to 
sentencing adjustments for role in the offense. Both predictions are supported by the data. 
The gender gap is significantly larger in multi-defendant cases: 66% compared to 51% 
(Table 5). Approximately 14%) of the otherwise-unexplained disparity in non-drug cases and 
20% in drug cases can potentially be explained by role adjustments (Table 7). The girlfriend 
theory appears to explain part, but not most, of the gender gap; it is hard for it to explain the 
large disparities that persist even in single-defendant cases. 

3.3. Parental responsibilities. 

Another possibility is that prosecutors and/or judges worry about the effect of 
maternal incarceration on children. The estimates are robust to controls for marital status and 
number of dependents, but these variables do not capture all differences in care 
responsibilities, including custody status. Other research shows that female defendants are far 
more likely than men to have primary or sole custody, and incarcerating women more often 
results in foster care placements (see Hagan and Dinovitzer [1999] for a review of the 
literature; Koban 1983). In an experiment asking judges to give hypothetical sentences based 
on short vignettes, Freiburger (2010) found that mentioning childcare reduced judges' 
probability of recommending prison, but mentioning financial support for children did not. 

The childcare theory suggests that one would expect to see the largest gender 
disparities among single parents, and the smallest among defendants with no children. That 
expectation is borne out by the data: compare Table 5, Columns 6-8. The TUT estimate is 
still over 50% among childless defendants, however, so the childcare theory appears not to 
fully explain the gender gap, but it probably explains part of it. 

On the other hand, the decompositions in Table 7 indicate that, at most, between 1% 
and 2% of the sentencing gap can be explained by disproportionate invocation of the official 

"family hardship" departure in the Sentencing Guidelines. Women in the sample receive that 
departure at three times the rate of men: 2.4% of cases versus 0.8%. But because the 
departure is so rare for both genders, it cannot explain much of the overall disparity. This is 
presumably because it requires "extraordinary circumstances," and judges typically hold that 
single parenthood does not suffice (see U.S.S.G. 5H1.6; Raeder 2006). Likewise, the main 
federal sentencing statute, 18 U.S.C. 3553, does not mention family hardship, and the 
Guidelines affirmatively instruct that family ties are "not ordinarily relevant." Federal 
sentencing law is not designed to provide much accommodation for defendants' children. 

In short, the family status-gender interaction appears to be more substantial than the 
one formal legal mechanism for accommodating family hardship can explain. Prosecutors 



The fonnal departure for duress or coercion (U.S.S.G. 5K2.12), while given to women at five times the rate 
of men (0.4% versus 0.08%), is far too rare to be a significant explanation for the gender gap. 

The gender gap is also slightly smaller in states with federal women's prisons (see Table 5, Columns 13-14), 
which may suggest that judges do not want to move women far from their families, although this is not a 
dramatic difference and other characteristics of those seven states might explain it. 
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and/or judges seem to use their discretion to accommodate family circumstances in sub rosa 
ways — but not for male defendants. Among single men, conditional on observables, having 
children significantly increases sentences, and among married men, children make no 
significant difference. There are many competing arguments concerning whether family 
status is a proper sentencing consideration (see, for example, Markel, Collins, and Leib 
2007), and I will not address them here. However, z/ family hardship is a legitimate 
consideration, one might expect it to play at least some role in men's cases as well. 
Numerous studies have suggested that paternal incarceration harms children even when the 
father was already a noncustodial parent (see Hagan and Dinovitzer [1999] for a review). 

3.4. Cooper ativeness. 

Another often-advanced theory is that female defendants receive leniency because 
they are more cooperative with the government. These data provide, at best, limited support 
for that theory. Conditional on observables, women are modestly but significantly more 
likely to receive downward departures for cooperation in another case (20% versus 17%), 
have higher guilty plea rates (97.5 vs. 96.2%), and have their cases resolved about two weeks 
sooner on average (a 10% difference). But the interpretation of these differences is not clear. 
Plea rates, timing, and cooperation are all endogenous, turning on the deals being offered. 
Moreover, women could be being rewarded more for the same level of cooperation; the 
actual assistance they provide is unobserved. On all four charge- and conviction-severity 
scales, women receive modestly but significantly larger charge reductions in plea-bargaining 
than men do, and far more favorable findings of fact, suggesting that they may be offered 
better factual stipulations. If women really are inherently more cooperative (or risk-averse), 
one might think prosecutors could get away with offering them lesser discounts, and still 
induce frequent guilty pleas. Yet the opposite appears to be true. 

Whatever the merits of these indicators of cooperativeness, they seem to explain only 
fairly modest portions of the gender gap. Adding a plea and elapsed-time indicator to the 
reweighting reduces the unexplained disparity by about 8%) (Table 5, Col. 21). Disparities in 
departures for cooperation can explain up to 9% of the otherwise-unexplained gap in drug 
cases, but no significant share in non-drug cases (Table 7). In addition, the "acceptance of 
responsibility" reduction and the obstruction of justice enhancement do not explain any 
substantial portion of the gender gap; in non-drug cases these offset one another, while in 
drug cases neither is significant (Table 7). Unlike that of the family hardship departure, the 
limited explanatory power of these adjustments and departures cannot be attributed to rarity 
or tight legal constraints — all are very common. Formal mechanisms for recognizing 
women's purportedly greater cooperativeness are readily available, and yet they explain only 
a modest share of the disparity in drug cases and none in non-drug cases. 

3.5. Mental health, addiction, abuse, and other sympathetic life circumstances. 

Another theory is that female defendants may have more troubled life circumstances, 
such as poverty, mental illness, addiction, and abuse histories. If so, they may be perceived 
as less morally culpable or as candidates for rehabilitation. Criminal defendants often come 
fi-om difficult backgrounds. This could well be disproportionately true for females; perhaps 
because women more rarely commit crime, those who do are likelier to be in the upper tail of 
the life-hardship distribution. Prisoner studies show more self-reported mental illness and 
prior abuse among women. See James and Glaze (2006); Harlow (1999). 
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Socioeconomic status is not unobserved, however, and does not seem to explain the 
gender gap. The main specification includes education, and the results are robust to adding 
county-level socioeconomic controls and defense counsel type (a strong proxy for poverty). 
But mental health, addiction, and abuse are not observable unless judges cite them as the 
basis for a departure. The Guidelines permit departures for "unusual" mental and emotional 
conditions (U.S.S.G. 5H1.3) and for "significantly reduced mental capacity" (U.S.S.G. 
5K2.13). They prohibit departuers for "disadvantaged upbringing" (U.S.S.G. 5H1.12) and in 
most cases for addiction (U.S.S.G. 5H1.4), although judges have more flexibility to disregard 
these restrictions after Booker. Together, all such cited bases for departures explain only 
between 1 and 2% of the otherwise-unexplained gap in sentence length; they are too rare too 
explain more. If prosecutors or judges take such factors into account in informal ways (as 
they seem to with family hardship, above), it would be unobservable. 

3.6. Race-Gender Interactions. 

Columns 11-12 of Table 5 show that the gender gap is substantially larger among 
black than non-black defendants (74% versus 51%). The race-gender interaction adds to our 
understanding of racial disparity: racial disparities among men significantly favor whites,^^ 
but among women, the race gap in this sample is insignificant (and reversed in sign). The 
interaction also offers another theory for the gender gap: it might partly reflect a "black male 
effect" — a special harshness toward black men, who are by far the most incarcerated group in 
the U.S. This possibility is not really an "explanation" for the gender gap, much less a reason 
to worry less about it — but it might cause policymakers to understand it differently, as an 
issue of intersectional race-gender disparity. This theory only goes so far, however — ^the 
gender gap even among non-blacks is over 50%, far larger than the race gap among men. 

3. 7. Gender discrimination: preference-based and statistical. 

Although several of the factors above appear to explain portions of the gender gap, 
that gap is large enough that it is plausible that gender discrimination also contributes. If so, 
several types of discrimination could be at play. The theoretical literature suggests 
"chivalry" and "paternalism" (see, for example. Franklin and Feam [2008]). Another theory 
is selective sympathy: perhaps circumstances like family hardship or "bad influence" appear 
more sympathetic when it is women who are in them. Psychology experiments have found 
that attributions of blame and credit are often filtered through expectations that males are 
"agentic" and active and women are "communal" and passive (see Eagly, Wood, and 
Diekman [2000] for a review). If so, prosecutors or judges might more readily credit societal 
or situational explanations for females' crimes than for males.' 

Statistical discrimination is also possible. Perhaps the likeliest such mechanism is 
that prosecutors or judges might assume men are more dangerous than women. Studies 
generally find that women have lower recidivism rates, though some of the difference may be 
explained by characteristics that this study controls for (see Gendreau, Little, and Goggin 
[1996] for a meta-analysis). I do not have recidivism data to test whether statistical 
discrimination might be "rational" here. Note that if recidivism risk perceptions are based on 
individual information about the offender (not based on gender), then it is perfectly 
permissible to consider them. But punishment decisions based on statistical generalizations 



Rehavi and Starr (2012) explore these more extensively, finding a 10% unexplained disparity. 
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about men and women are unconstitutional. The Supreme Court has repeatedly ruled that 
reliance on gender stereotypes is impermissible even if those stereotypes are statistically well 
founded (see J.E.B. v. Alabama ex rel T.B., 511 U.S. 127 [1994]). 

Conclusion 

This study finds dramatic unexplained gender gaps in federal criminal cases. 
Conditional on arrest offense, criminal history, and other pre-charge observables, men 
receive 63% longer sentences on average than women do. Women are also significantly 
likelier to avoid charges and convictions, and twice as likely to avoid incarceration if 
convicted. There are large unexplained gaps across the sentence distribution, and across a 
wide variety of specifications, subsamples, and estimation strategies. The data cannot 
disentangle all possible causes of these gaps, but they do suggest that certain factors (such as 
childcare and offense roles) are partial but not complete explanations, even combined. 

These estimates are much larger than those of prior studies, which have probably 
substantially understated the sentence gap by filtering out the contribution of pre-sentencing 
discretionary decisions. In particular, this study highlights the key role of sentencing fact- 
finding, a prosecutor-dominated stage that existing disparity research ignores. Mandatory 
minimums — ^prosecutors' most powerful tools — are also important contributors to gender 
gaps in drug sentencing. Understanding the relative roles of prosecutors and judges is 
important. Gender disparities have been cited to support constraints on judicial discretion, 
including when the Sentencing Guidelines were adopted. But such constraints typically 
empower prosecutors, so if prosecutors drive disparities, they could backfire. 

Policymakers might simply be untroubled by leniency toward women. They are a 
small minority of defendants, and when disparities favor traditionally disempowered groups, 
they might raise fewer concerns. But the gender disparity issue need not be framed in terms 
of how women are treated. One could ask: why are men treated so harshly, if women are 
(apparently) treated otherwise? It is hard to dismiss this question as trivial: over two million 
American men are behind bars. While males generally are not a disadvantaged group, men 
in the criminal justice system generally are; they are mostly poor and disproportionately 
nonwhite. The especially high rate of incarceration of men of color is a serious social 
concern, and gender disparity is one of its key dimensions. 

From this perspective, one might think differently about some of the possible 
explanations for the gender gap. Most defendants of both genders have suffered serious 
hardship, have mental health or addiction issues, have minor children, and/or have 
"followed" others onto a criminal path. Sentencing law provides very limited formal 
mechanisms to account for such factors — which is probably why, with women, they appear 
to mostly be considered sub rosa. If prosecutors, judges, and legislators are comfortable with 
those factors playing a role in the sentencing of women, then perhaps it is worth explicitly 
reconsidering their place in criminal sentencing more generally. 
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Table 1 

Summary Statistics 





(1) 
Mean 


(2) 
Female 
Mean 


(3) 
Male 
Mean 


(4) 

Ob servations 


District court defendants sentenced for non-petty crimes: 










Male 


0.808 


0 


1 


231,694 


White 


0.646 


0.652 


0.645 


231,694 


Black 


0.310 


0.295 


0.313 


231,694 


Other Race 


0.044 


0.053 


0.042 


231,694 


Age (Years) 


34.1 


34.5 


34.0 


231,694 


U.S. Citizen 


73.7 


82.6 


71.6 


231,694 


Non-Parent 


0.368 


0.374 


0.366 


187,651 


Married Parent 


0.300 


0.244 


0.313 


187,651 


Single Parent 


0.333 


0.383 


0.321 


187,651 


Muhi-Defendant Case 


0.473 


0.472 


0.473 


231,694 


Education: 










HS Dropout 


0.418 


0.342 


0.436 


231,694 


HS Diploma 


0.213 


0.236 


0.208 


231,694 


GED/Vocational 


0.130 


0.123 


0.132 


231,694 


College 


0.239 


0.300 


0.224 


231,694 


Criminal History: 










Category 1 (low) 


0.565 


0.737 


0.524 


231,694 


Category 2 


0.106 


0.093 


0.109 


231,694 


Category 3 


0.127 


0.091 


13.6 


231,694 


Category 4 


0.066 


0.034 


0.074 


231,694 


Category 5 


0.038 


0.018 


0.043 


231,694 


Category 6 (high) 


0.097 


0.028 


0.114 


231,694 


Offense Category: 










Property/Fraud 


0.282 


0.468 


0.237 


231,694 


Regulatory 


0.055 


0.054 


0.055 


231,694 


Drug 


0.590 


0.446 


0.625 


231,694 


Violent 


0.073 


0.032 


0.083 


231,694 


Sentenced to Prison 


0.818 


0.639 


0.861 


231,617 


Prison Sentence Length (Months) 


56.9 


25.2 


64.4 


231,617 


Prison Sentence Length (If Incarcerated) 


69.5 


39.5 


74.8 


161,032 


All arrestees in filing-stage sample 










Filed in District Court 


0.919 


0.905 


0.922 


386,205 


All district-court defendants in conviction- stage sample 










Convicted (Non-Petty) 


0.928 


0.913 


0.932 


286,709 
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Table 2 

RhGRtSSlOX ESI IMAI HS OF MhAN GtNDbR DlSPARl l lbS IX CASb PROChSSlNG* 



(1) (2) (3) (4) 
Filing in District Non-Petty Conviction Incarceration Log Prison Length 
Court (Odds Ratios) (Odds Ratios) (Odds Ratios) (If Incarcerated) 





Coefficient 


SE 


Coefficient 


SE 


Coefficient 


SE 


Coefficient 


SE 


Male 


1 213*** 


.044 


1 293*** 


.029 


2 ]^93*** 


.052 


Q 347*** 


.014 


Black 


1.023 


.045 


0.919** 


.025 


0.909*** 


.023 


0.063*** 


.012 


Other 


1.544** 


.201 


0.928 


.043 


0.929 


.050 


0.0170 


.029 


Age 


1.009*** 


.002 


0.989*** 


.001 


1.001 


.001 


0.0063*** 


.000 


U.S. citizen 


1.480** 


.215 


1.061 


.035 


0.674*** 


.027 


-0.037* 


.016 


Multi-defendant 






0.680*** 


.020 


1.115*** 


.031 


0.158*** 


.017 


Ed. 2: HS Grad 










0.864*** 


.020 


-0.0205* 


.008 


Ed. 3: GED 










0.902*** 


.026 


0.0217** 


.007 


Ed. 4; College 










0.944* 


.027 


0.001 


.008 


Crim. His. Cat. 2 










2.165*** 


.070 


0.261*** 


.015 


Crim. His. Cat. 3 










3.525*** 


.124 


0.364*** 


.015 


Crim. His. Cat. 4 










7.336*** 


.370 


0.511*** 


.016 


Crim. His. Cat. 5 










11.573*** 


.820 


0.650*** 


.017 


Crim. His. Cat. 6 










19.424*** 


1.238 


0.944*** 


.014 


N 


379,148 


282,938 


231,613 




189,498 





Note. - Ed. Cat. = educational category; Crim His. Cat. = criminal history category. Odds ratios/coefficients 
are from logistic and OLS regressions that also include arrest-offense and district fixed effects. 
*Standard errors clustered on arrest-district, respectively. *p.<0.05; **p<0.01; ***p<0.001. 
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Table 3 

Possible Effects of Sample Selection on Estimation of Disparity in Non-Zero Prison Sentences: 

Comparison of Trimmed-Sample Estimates* 





(1) 


(2) 


(3) 


(4) 




Coefficient SE 


Coefficient SE 


Coefficient SE 


Coefficient SE 


Male 


0.347*** .014 


0.669*** .020 


0.629*** .018 


0.497*** .014 


Sample Trim 


Untrimmed 


Influence-Based 


Shortest 


Random Short 


N 


189,498 


166,586 


166,586 


166,586 



Note. - This table compares the "male" coefficient fi^om Table 2, Column 4 to estimates for the same 
regression in samples that have male observations dropped so that the gender ratio in the trimmed sample 
matches the counterfactual ratio predicted by the Table 2, Column 3 regression if males were, conditional on 
observables, incarcerated only at the rate of women. The samples in Columns 2-4 are trimmed based on 
differing assumptions about which males are on the incarceration margin. Column 2 trims the male cases with 
the most negative individual influence on the "male" coefficient. Column 3 trims those with the shortest 
nonzero sentences, and Column 4 trims randomly from the male cases that have sentences no longer than 24 
months. 

*Standard errors are clustered on the offense-district. *p<0.05, **p<0.01, ***p<0.001 
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Table 4 

Average Gender Disparities in Prison Sentence Length (Including Zeros); Inverse Propensity- 
Score Reweighting Estimates* 



Treatment 

Average Treatment Effect on 
(Treated=Male) Treatment on Treated (Men) Women 





(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


Male 


23.23*** 


17.60*** 


17 29*** 


25.13*** 


18.67*** 


18.55*** 


15.34*** 


(2.716) 


(1.923) 


(1.373) 


(2.908) 


(1.936) 


(1.409) 


(1.701) 


Constant 


36.58*** 


29.76*** 


27.85*** 


39.28*** 


31.57*** 


30.98*** 


25.20*** 


(3.393) 


(2.986) 


(2.254) 


(3.505) 


(2.985) 


(2.183) 


(2.472) 


Percent 


63.5 


59.1 


62.1 


64.0 


59.1 


59.9 


60.9 


Sample 


Full 


PS Trim 


Low CH 


Full 


PS Trim 


Low CH 


Full 


N 


231,582 


190,535 


173,407 


231,582 


190,535 


184,787 


231,582 



Note. - Columns 1-3 show the average increase in sentence in months associated with changing all cases 
from the female to the male treatment condition, estimated by inverse propensity-score reweighting. Covariates 
used to estimate propensity scores are arrest offense, criminal history, education, age, race, U.S. citizenship, and 
multi-defendant case flag. Column 1 shows full-sample results. The Column 2 sample is trimmed to eliminate 
extreme propensity score values (p(m)>.93), and the Column 3 sample is limited to cases in criminal history 
categories 1-3. For the same samples. Columns 4-6 shows the "average treatment effect on the treated" (men) 
obtained by comparing the observed male average to the reweighted female average. Column 7 shows the 
counterfactual "average treatment effect on the untreated" (women) obtained by comparing the observed female 
average to the reweighted male average, for the full sample. The "constant" line is the average in the female 
treatment condition and the "percent" line expresses the treatment effect as a percent of this female average. 

*Standard errors are clustered on the strata within which propensity scores are balanced. *p<0.05, 
**p<0.01, ***p<0.001. 
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Table 5 

Alternate Samples and Specifications; Inverse Propensity- Score Reweighting Estimates, 



Treatment on Untreated (Women)* 












(.->) 




Sample 


Main 


Drug 


Prop./ Reg. 


Non-Parent 


Married 
Parent 


Single Parent 


Male 


15.34*** 


23.35*** 


5.975*** 


12.82*** 


13.40*** 


17.63*** 




(1.70) 


(1.115) 


(0.408) 


(1.717) 


(1.877) 


(2.608) 


Constant 


25.20*** 


40.00*** 


11.01*** 


24.85*** 


22.60*** 


27.26*** 




(2.472) 


(2.064) 


(0.893) 


(3.154) 


(2.531) 


(3.749) 


Percent 


60.9 


58.4 


54.3 


51.6 


59.3 


67.3 


N 


231,582 


136,730 


77,989 


68,890 


56,085 


62,419 




\' ) 








CI 1~> 




Sample 


Pie-Booker 


Post-Booker 


Multi- 
Defendant 


Sole 
Defendant 


Din ^Ij. 

rslaCK 


Non-Black 


Male 


14.65*** 


15.89*** 


21.42*** 


9 599*** 


17.52*** 


13.20*** 




(1.855) 


(1.961) 


(2.421) 


(1.553) 


(2.645) 


(1.087) 


Constant 


23.81*** 


26.46*** 


32.43*** 


18 73*** 


23.68*** 


25.83*** 




(2.554) 


(3.067) 


(2.90) 


(2.99) 


(3.80) 


(1.947) 


Percent 


61.5 


60.1 


66.0 


51.2 


74.0 


51.1 


N 


109,663 


121,883 


109,487 


121,875 


71,737 


159,801 






C \ d\ 


(ID) 


(LO) 


U ' ) 


(Its) 


Sample 


outica w/ vv . 


OlctlCIS w/u 


Pr\lir*f* ^Ur»f'fac 


-TUllV^C iNULCa 


i/aiiiiiy ivcv.' \l 


r ciiiiiiy 


Pris. 


W Pris. 


Rec'd 


Flags 




Added 


Male 


14 45*** 


15.59*** 


15.75*** 


15.57*** 


15.07*** 


15.19*** 




(1.626) 


(2.149) 


(1.665) 


(1.606) 


(1.842) 


(1.407) 


Constant 


25.79*** 


24.81*** 


26.44*** 


26.44*** 


25.22*** 


25.23*** 




(2.78) 


(2.96) 


(2.71) 


(2.78) 


(2.777) 


(2.339) 


Percent 


56.0 


62.8 


59.6 


58.9 


59.8 


60.2 


N 


91,470 


139,932 


134,613 


134,613 


187,553 


187,549 




(19) 


(20) 


(21) 


(22) 


(23) 


(24) 


Sample 


Counsel 
Rec'd 


Counsel 
Added 


Plea/Time 
Added 


Drug Qty 
Rec'd 


Drug Qty 
Ctrl. 


Presumpt. 
Sentence 


Male 


15.33*** 


14.83*** 


14.06*** 


19.28*** 


17 83*** 


5.661*** 




(1.531) 


(1.351) 


(1.607) 


(1.943) 


(1.720) 


(0.748) 


Constant 


26.70*** 


26.70*** 


25.20*** 


33.20*** 


33.20*** 


25.20*** 




(2.224) 


(2.521) 


(2.523) 


(2.060) 


(2.372) 


(4.218) 


Percent 


57.4 


55.5 


55.8 


58.1 


53.7 


22.5 


N 


135,471 


135,470 


231,582 


37,074 


37,074 


231,617 



Note. - The constant reflects the observed female average sentence length in months for the designated 
sample (including zeros) and the "male" coefficient is the average additional sentence length predicted if these 
cases were treated as male, based on inverse propensity score reweighting of the observed male sentences using 
the same covariates as in Table 4 except as noted. Standard errors are clustered on the strata within which 
propensity scores are balanced. *p<0.05, **p<0.01, ***p<0.001. 
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Table 6 

Serial Decomposition of Average Gender Disparity by Procedural Sources: IPW Estimates of 
Treatment on Untreated (Women) 

Non-Drug Cases (Observed Female Mean: 13.27 Months) 











[4] Add 








[1] No 


[2] Pre- 


[3] Add 


Conviction 




Remainder 




Controls 


Charge Charge Sev 


Sev. 


[5] Add 


(Attrib. to 




("Total Gan'l 


Control s 


KTeasures 


J-VX t dij HI t ij 


f*t- fi n H i n CT 


SipntPTifin (T^ 


Unexplained 


26.90*** 




1 jg*** 


6.88*** 


2 J3*** 




Gap (Months) 


(0.37) 


(0.31) 


(0.30) 


(0.29) 


(0.27) 


N/A 


As % of Female 














Mean 


202 


59.5 


54.1 


51.8 


16.1 


N/A 


Share Explained 




19 01*** 




0.30*** 






by This Stage 


N/A 


(0.29) 


(0.08) 


(0.05) 


(0.13) 


(0 27) 


This Stage As % 


N/A 


70.7 


2.6 


1.1 


17.7 


7.9 


of Total Gap 




This Stage as % 














of Disparity in 


N/A 


N/A 


9.0 


3.8 


60.2 


27.0 


Justice Process 


















Drug Cases (Observed Female Mean: 40.04 Months) 
















Remainder 




[IJ No Controls 


[2] Pre-Char 


ge [3] Add Conviction 


[4] Add 


(Attrib. to 




(Total Gap) 


Controls 




Mand. Min. 


Fact-finding 


Sentencing) 


Unexplained 


38.92*** 


23.38*** 




15.57*** 


8.67*** 




Gap (Months) 


(0.42) 


(0.38) 




(0.35) 


(0.29) 


N/A 


As % of Female 














Mean 


97.2 


58.4 




38.9 


21.7 


N/A 


Share Explained 




15.54*** 




J gj*** 




8.67*** 


by This Stage 


N/A 


(0.30) 




(0.22) 


(0.20) 


(0.29) 


This Stage As % 














of Total Gap 


N/A 


39.9 




20.1 


17.7 


22.3 


This Stage as % 














of Disparity in 


N/A 


N/A 




33.4 


29.5 


37.1 


Justice Process 















Note. - Column 1 shows the average observed male-female sentence gap in months, while Column 2 shows 
the gap when males are reweighted on the inverse propensity score using the pre-charge covariates from Table 
4. In the other columns, additional covariates have been added sequentially. In Panel A, Column 3 adds the 
mandatory minimum, statutory maximum, and guidelines sentence associated with the initial charges; Column 4 
fiirther adds the same measures for the charges of conviction; and Column 5 further adds the final Guidelines 
offense level. In Panel B, Column 3 adds the mandatory minimum at conviction, and Column 4 further adds the 
final offense level. The "Share Explained by This Stage" row is based on the reduction of the vmexplained 
relative to the preceding step, and the rows below it express this share as a percentage of the total gap and the 
gap unexplained by pre-charge covariates. The last column in each panel ("Share Remaining") expresses the 
residual vmexplained in the preceding column, which is attributed to the final sentencing decision, in percentage 
terms, showing that the percentages the decomposition attributes to the procedural stages sum to 100%. 
*Standard errors are bootstrapped (500 replications). *p<0.05, **p<0.01, ***p<0.001. 
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Table 7 

Share of Mean Gender Gap Explained by Specific Findings of Fact and Departures: IPW 
Decomposition of Treatment on Untreated (Women)* 



Non-Drug Crimes (Gap Unexpl. Drug Crimes (Gap Unexpl. By 

by Pre-Charge Controls; 7.9 Pre-Charge Controls: 23.4 
Months) Months) 





iviuiiliia 


Sharp nf Oan (Vn^ 


iviuuiua 


Sharp nf Han (%\ 


ruiuuigs oi rdCL. 










Aggravating/Mitigating Role 












\\J.\JDZ. ) 




en 1 TS'i 

\\J. IZof 




Apppt^ttinpp of T? pQnr^nQihilit\7 


-0 761 *** 




0.039 






(0.037) 


-3.3 


(0.094) 


0.2 


Obstruction of Justice 


0.228*** 




0.076 






(0.042) 


2.9 


(0.070) 


0.3 


Loss Amount 


1.585*** 










(0.065) 


20.1 


N/A 


N/A 


Drug Quantity 






0.740*** 






N/A 




(0.103) 


3.2 


Drug Safety Valves 






2.074*** 




(Guideline s/Mand Min Waiver) 


N/A 




(0.111) 


8.9 


Departures: 










Family Ties 


0.123*** 




0.287*** 






(0.018) 


1.6 


(0.032) 


1.2 


Substantial Assistance/ 


0.069 




2.141*** 




Cooperation 


(0.037) 


0.9 


(0.108) 


9.2 


Mental Health/Abuse/Addiction 


0.136*** 




0.235*** 






(0.024) 


1.7 


(0.030) 


1.0 



Note. - incremental reductions in unexplained disparity when particular findings of fact or departures are 
added to the IPW reweightings in Table 6. Findings of fact are added to weights that already include all 
variables through the conviction stage as noted in Table 6. Departures are added to weights that also included 
the final Guidelines offense level. 

*Standard errors are bootstrapped (500 replications). Because these figures are based on adding each of 
these variables independently (rather than together or sequentially), their collective explanatory power may be 
overstated if the variables are coUinear with one another. *p.<0.05, **p<0.01, ***p<0.001. 
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FIGURES 

Figure 1 a. - Distribution of Gender Propensity Scores for Full Sample 




.2 .4 .6 .8 1 

Propensity Score 




Female I I Male 



Figure lb. - Distribution of Gender Propensity Scores for Low Criminal History Sample 
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Figure 2. - Gender Disparities in the Sentencing Distribution: Females vs. Reweighted Males 
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Figure 3a. - Sequential Reweighting of the Sentencing Distribution: Non-Drug Cases 



0.6 



0.5 



c 
o 



■5 0.4 
.a 



5 0.3 



(0 

a 

"S 

(0 
£ 



0.2 -- 



0.1 



I 



NP 



□ Observed Male 
■ nPre-Chg Controls - 

□ + Chg Controls 
H + Conv Ctrls 
n + Factfinding - 



11 



<12 12 to 24 24 to 36 36 to 60 60 to 1 20 >120 
Sentence (Months) 



Figure 3b. - Decomposition of Gender Gaps in the Sentencing Distribution by Procedural 

Source: Non-Drug Cases 
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Figure 3c. - Sequential Reweighting of the Sentencing Distribution: Drug Cases 
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Figure 3d. - Decomposition of Gender Gaps in the Sentencing Distribution by Procedural 
Source: Drug Cases 
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DATA APPENDIX 

1. The Linked Dataset 

This project is based on a linked, multi-agency dataset from four federal agencies: the 
U.S. Marshals' Service (USMS), the Executive Office for the U.S. Attorneys (EOUSA), the 
Administrative Office of the U.S. Courts (AOUSC) and the U.S. Sentencing Commission 
(USSC).^*^ These datasets are collected by the Bureau of Justice Statistics (BJS) and, 
pursuant to security conditions, made available to researchers via the National Archive of 
Criminal Justice Data along with linking files that allow records to be linked at an individual 
level across the agencies.^' Together, these files trace cases from arrest through sentencing. 

USMS collects information upon booking of arrestees in federal custody, based on 
arrest-stage information drawn from law enforcement. Their data include arrest offense, 
race, age, gender, marital status, a written offense description based on information fi-om law 
enforcement, U.S. citizenship status, arrest date, the federal judicial district, and the arresting 
agency. EOUSA collects investigation- and case-related data for prosecutors; its fields were 
used to determine whether arrestees were charged before a district judge and for information 
on the type and quantity of drugs seized in arrests. Data on the initial and final charges in the 
case (and their disposition) as well as the number of co-defendants, defense counsel type, and 
the county of the offense came from the AOUSC, which compiles district court records. The 
USSC provides information recorded by judges on the sentence, the mandatory minimum 
applicable at sentencing, the defendant's criminal history, education level, number of 
dependents, and Hispanic ethnicity, as well as rich detail on the specific findings of 
"sentencing facts" entered by judges as well as the reasons given for departure from the 
Sentencing Guidelines range. 

The linking algorithm is dyadic and includes both inter- and intra-agency links, 
because EOUSA and AOUSC each have multiple kinds of files. There are two possible 
linking pathways that incorporate all the relevant fields. The first runs fi-om USMS to the 
EOUSA suspect investigation file to the EOUSA "cases terminated" file to the AOUSC 
"cases terminated" file to the USSC. The second runs from USMS to the EOUSC suspect 
investigation file to the EOUSA "cases filed" file to the AOUSC "cases filed" file to the 
AOUSC "cases terminated" file to the USSC. The sample for the sentencing analysis is 
limited to cases that linked all the way through one of these two pathways. Link-through 
rates were 81% (USMS to EOUSA investigation files),^^ 93% (EOUSA to AOUSC, among 



The underlying linked dataset is the same as that used in a related paper on racial disparity by Rehavi and 
Starr (2012). However, the samples are constructed differently, in part because of different common-support 
concerns; this study uses more years of data, includes different case types, and includes all federal districts. 
This study also includes a number of additional covariates. 

^' Descriptions of the files are at http://www.icpsr.umich.edu/icpsrweb/content/NACJD/guides/fjsp.html. 

The lower link rate at this stage is probably because there are substantive reasons cases might not link 
through, in addition to failures of the linking algorithm. Cases would not link through if they were inmiediately 
either declined or transferred to some other authority (before opening a suspect investigation file). 
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cases filed in district court only), and 90% (AOUSC to USSC, among cases with convictions 
of non-petty offenses only), and did not significantly differ by gender.^^ 

2. Sample Restrictions 

2.1. Timing 

The main sentencing sample consisted of cases sentenced between October 1, 2000 
through September 30, 2009. The analyses of filing and conviction rates required case 
initiation (arrest or opening of the EOUSA investigation file, whichever is later) or 
disposition, respectively, during the same period. 

2.2. Case Type 

Immigration cases were excluded because their stakes tj^ically center on deportation 
rather than sentencing and because they often are handled via a very different fast-track 
process.^"* In order to achieve better overlap between the male and female samples, I also 
excluded several case categories in which the arrestees were over 95% male: sex and 
pornography-related offenses (except for prostitution), weapons offenses, conservation 
offenses (mainly illegal hunting and fishing), and family offenses (mainly failure to pay child 
or spousal support). The remaining case types were property and fraud offenses, regulatory 
offenses (excluding those mentioned above), non-sexual violent crimes, and drug offenses. 

All case type exclusions were based on the USMS arrest code. Defining the sample 
based on the arrest stage data alone avoided potentially serious sample selection issues that 
could have emerged had the exclusions been based on the prosecutor's discretionary 
decisions. The USMS codes are based on the principal arrest offense and may exclude some 
secondary criminal conduct (although in most cases, because concurrent sentencing is the 
default rule, secondary conduct will not affect the sentence). While virtually nobody in the 
sample was convicted of any immigration, sex/pornography, family, or conservation 
offenses, overlap between weapons cases and other cases is more common: about 6% of the 
sample was convicted of a weapons charge. The presence of weapons in violent crimes is 
often captured by the arrest code, and their presence in any kind of case is often captured by 
the police-notes-based description field that I use in robustness checks. 

Cases with arrest codes indicating a reason for detention other than a criminal offense 
(material witness warrants and violations of the conditions of parole or probation) were also 
excluded from the sample. 

3. Construction of Key Independent Variables 

3.1. Demographics 

Gender, race, U.S. citizenship, marital status, and age are recorded by USMS. Race is 
coded as white, black, Asian, Native, and Other/Unknown; the last three groups together 
constitute only A% of the sample, and I combined them. USSC provides number of 
dependents, education level, and Hispanic ethnicity (ethnic Hispanics are overwhelmingly 
coded as white in the race data). Marital status, number of dependents, and Hispanic 
ethnicity are sometimes missing and are included only in robustness checks. Also as a 



Rates of filing in district court and conviction of non-petty offenses are outcomes assessed in the paper; cases 
that drop out of the sample due to non-filing, dismissal, or acquittal do not reflect linking failures. 

Citizenship was included as a covariate, and non-citizens were excluded in robustness checks, because 
deportation is also an important concern when non-citizens are charged even in non-immigration cases. 
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robustness check, the county fields in AOUSC were linked to county level unemployment, 
poverty, and wage data from the Census Bureau and to crime data from the FBI's Uniform 
Crime Reporting Program. 

3.2. Arrest Offense 

There are 430 unique arrest offenses listed in the USMS data. The original arrest 
offense codes included many very similar offense descriptions, including some that were 
slightly more detailed versions of others (for instance, "vehicle theft" and "vehicle theft by 
bailee"). Often the more detailed ones were rarely used. Therefore, the smallest categories 
were combined with others that could describe the same legal offense. In addition, I 
subdivided some of the drug arrest offense codes based on the drug type information in the 
EOUSA investigation-stage file. This is because many drug cases are simply given the arrest 
code "dangerous drugs," and because the cocaine arrest codes combine crack and powder, 
which have different sentencing schemes. There were 123 resulting arrest-code groups. The 
results are robust to the use of the original offense codes. 

Note that the drug offense codes do not specify quantity. The drug quantity at arrest 
(in addition to type) is usually identified by the EOUSA investigation-stage file; however, the 
quantity field is unreliable starting in FY 2004. Therefore, the main analyses do not 
include quantity in the controls, but robustness checks confined to FY 2001-03 do. To 
enable quantity comparisons across drug types, quantities were translated into "marijuana 
equivalents" according to the conversion tables in the Sentencing Guidelines. 

3.3. Criminal History 

Criminal history data are only available in the USSC data and are accordingly only 
available for those sentenced for guideline offenses. The variable used was the defendant's 
criminal history category, which ranges from 1 to 6 and forms the basis of the Guidelines 
sentencing grid. In 0.2% of the sentencing sample, this field was originally missing but 
could be calculated based on another Sentencing Commission field called "criminal history 
points," according to the rules laid out in the Guidelines. 

3.4. Charge Severity Measures 

The raw charge and conviction data are the statutory provisions under which the 
defendant was charged and convicted. These provisions had to be translated into measures of 
charge severity in order to assess the contribution of charging and conviction severity to 
sentence disparities. On the basis of legal research, I identified the statutory maximum and 



No single number defined what categories were small enough to be combined, because the combination 
depended on the legal assessment that the crimes were sufficiently similar. 

There are drastic changes in the apparent quantity distribution in this field fi-om 2003 to 2004 as well as large 
inconsistencies in quantity between this field and the sentencing-stage quantities recorded by USSC beginning 
in 2004. EOUSA adopted a new data entry system in 2004, and it seems apparent that the problem is with this 
system; unfortunately the inconsistencies appeared neither to be uniformly applicable nor confined to particular 
drug types or districts, so there is no way to identify which cases are problematic. 

While the AOUSC data include a "severity" field, which is ostensibly based on the statutory maximum, it is 
not very useful because appears to automatically be based on the very highest maximum contained anywhere in 
the statute cited, even when that maximum is only triggered by an exceptional circumstance that rarely applies. 
For instance, charges under 18 U.S.C. § 1347 (health care fraud) are coded by AOUSC as having a statutory 
maximum of life, even though that maximum only applies when the fraud leads to a death. 
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minimum sentence, and the Guidelines-recommended sentence associated with each 
combination of charges and convictions. 

Because the cited statutory provisions sometimes contain varied sentencing schemes 
depending on the facts of the case, I researched the most common ways in which these 
statutes are charged in order to be able to make realistic assumptions in the face of such 
ambiguities. When possible, ambiguities were resolved by reference to the other charges in 
the case, when the legal elements of those charges revealed additional facts that the 
prosecutor must have been alleging. For instance, suppose Charge 1 is a burglary offense 
that usually has a maximum sentence of 10 years, but has a 20-year maximum if someone is 
seriously injured in the course of the burglary. Charge 2 is an aggravated assault charge, 
with a 15-year maximum, in which aggravated assault is defined to require that serious injury 
be proven. Because Charge 2's presence indicates that the prosecutor was alleging serious 
injury, the maximum sentence for Charge 1 is raised to 20 years. 

Implementing this approach required constructing a number of flags for every federal 
criminal statute, a complicated statutory interpretation task. The flags indicated whether 
certain facts were elements of the crime: death, injury, serious injury, drug crime, sex crime, 
fi-aud, official victim, minor victim, terrorist motive, an assault, use of a weapon, use of a gun 
specifically, a "crime of violence," obstruction of justice, taking a person for ransom, and 
whether the crime was a predicate offense for the crime of felony murder. Statutes also had 
to be coded to reflect adjustments to the statutory or guidelines sentences that would be 
triggered by the presence of particular facts as identified by the fiags for the other charges in 
the case. Remaining ambiguities were resolved according to default assumptions that varied 
between the severity measures. 

Constructing a measure of the Guidelines sentence involved additional challenges. 
First, the statutory provisions cited by AOUSC had to be matched to corresponding 
Sentencing Guidelines. The actual Guidelines range calculated by the judge is not solely 
determined by the charges; rather, it is heavily driven by sentencing fact-finding. However, 
the point of the charge-severity measures is to distinguish the effect of charging and 
conviction severity itself from that of sentencing fact-finding. Thus, the Guidelines-based 
measures of charge and conviction severity represent base offense levels determined solely 
by what the prosecutor charged (or what the defendant was convicted of), that is, the 
elements of the crime. It is based on applying the Guidelines assuming the elements of all 
charges brought were proven, but no additional findings of fact were made at sentencing. 

The Guidelines define the "offense level" — a severity scale running from 1 to 43 — 
associated with each offense. In order for the units of this measure to be comparable to the 
other metrics, this offense level had to be converted into an implied sentence length in 
months. Under the Guidelines, offense levels translate mechanically into sentence ranges 
based on a grid, with criminal history as the other axis. The same column (Column 6) was 
used for the translation in all cases, such that the charging and conviction measures are blind 
to the defendant's actual criminal history — ^they reflect charge severity alone, and criminal 
history is a separate covariate. The number of months used was the low end of the range in 
the applicable grid cell. 



A detailed spreadsheet showing these flags and other details on coding choices is available on request. 



iv 



Starr — Estimating Gender Disparities in Federal Criminal Cases 

Once the severity of the individual charges were coded, they were combined into total 
severity measures for all charges. In general, the severity of federal cases is determined by 
the most serious charge alone, because concurrent sentencing is the default rule. Thus, 
secondary charges affected the charge severity measures only when one of the charging 
statutes was an offense specifying that consecutive sentencing was required. As described 
above, however, information drawn from secondary charges could be used to adjust the 
coding of the primary charge. This approach to combining charges follows the method 
specified in the Sentencing Guidelines (see U.S.S.G. § 5G1.2). 

Two final adjustments were then made. First, the statutory minimum and the sum of 
the individual-charge maximums were imposed as lower and upper constraints, respectively, 
on the Guidelines sentence, which also tracks sentencing law (see U.S.S.G. § 5G1.2). 
Second, zeros on the statutory maximum, guidelines, and mean sentence scales were replaced 
with half a month — ^half of the lowest nonzero values otherwise calculated — to reflect the 
fact that no criminal charge truly has zero severity, even if no incarceration is imposed. This 
adjustment affected only 0.05% of cases for the statutory maximum measure, 0.2% of cases 
for the guidelines measure, and 0.5% for the mean sentence measure. 

The mandatory minimum measure was turned into an indicator variable for whether 
there was any nonzero mandatory minimum and (for alternate specifications) a categorical 
variable designating whether the mandatory minimum was 0, less than 10 years, and 10 years 
or more. Similar variables were constructed based on the actual mandatory minimum of 
conviction recorded at the sentencing stage in the USSC data. 

3.5. Conviction and Sentence Outcomes 

A dummy variable for whether the defendant was convicted of a non-petty offense 
was constructed based on AOUSC records. Non-petty offenses are those carrying more than 
six months as a statutory maximum, so the classification of offenses is based on the statutory 
maximum measure described above. Conviction of a non-petty offense is a prerequisite for 
inclusion in the Sentencing Commission data. 

Sentence data were drawn fi-om the Sentencing Commission and are therefore only 
available for those convicted of offenses covered by the sentencing guidelines. Sentence 
lengths were truncated at 540 months, and life sentences were given that value. This length 
is longer than the highest non-life statutory maximum found in federal law (480 months), and 
corresponds approximately to the remaining life expectancy of an American of the sample- 
average age. Only 0.7% of sentenced cases were affected by this truncation. 
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